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ABSTRACT 

A statistical procedure is presented that is designed 
to test for unidirectional test bias existing simultaneously in 
several items of an ability test, based on the assumption that test 
bias is incipient within the two groups' ability differences. The 
proposed procedure — Simultaneous Item Bias (SIB) — is based on a 
multidimensional item response theory (IRT) approach. SIB 
statistically tests for bias in one or more items at a time, and is 
corrected for the inflation (or deflation) of the test statistic due 
to target ability difference, a valid group difference that is 
conceptually independent of psychological test bias. The correction 
plays the same role as does the practice of including the single 
studied item in the matching criterion score in the Mantel-Haenszel 
(MH) procedure that is adapted for test responses. It is shown 
through the initial portion of an extensive simulation study in 
progress with 84 cases that, with the correction in place, the 
procedure performs as well as does the MH procedure in many cases 
when there is a single biased item, and it performs well in the case 
of multiple item test bias. Twelve tables present data from the 
simulation, and four graphs illustrate study findings. A 20-item list 
of references is included. (Author/SLD) 
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ABSTRACT 



This paper presents a statistical procedure (denoted by SIB) designed to test for uni- 
directional test bias existing simultaneously in several items of an ability test. It was 
argued in Shealy and Stout (1991) that in order to model such bias with an IRT model, a 
multidimensional model is necessary. The proposed procedure, based on this multidimen- 
sional IRT modeling approach, statistically tests for bias in one or more items at a time 
and is corrected for the inflation (or deflation) of the test statistic due to target ability 
difference, a valid group difference that is conceptually independent of psychological test 
bias. The correction plays the same role as the practice of including the single studied 
item in the "matching criterion" score in the Mantel-Haenszel (MH) procedure adapted 
for test responses by Holland and Thayer (1988). It is shown through the initial portion of 
an extensive simulation study underway (Shealy (1991)) that, with the correction in place, 
the procedure performs as well as the MH procedure in many cases when there is a single 
biased item, and performs well in the case of multiple item test bias. 



Key Words: item bias, test bias, DIF, latent trait theory, item response theory, target abil- 
ity, valid subtest, nuisance determinants, potential for bias, expressed bias, unidirectional 
test bias, bidirectional test bias, SIB, Mantel-Haenszel. 



INTRODUCTION 



The purpose of this paper is to present a statistical procedure (denoted by SIB for 
simultaneous item bias) for detecting bias present in one or more test items of a standard- 
ized ability test. The procedure is based on the multidimensional item response theory 
(IRT) model of test bias presented in Shealy and Stout (1991). By "test bias" we mean 
a formalization of the intuitive idea that a test is less valid for one group of examinees 
than for another group in its attempt to assess examinee differences in a prescribed la- 
tent trait, such as mathematics ability. Test bias is conceptualized herein as the result of 
individually-biased items acting in concert through a test scoring method, such as number 
correct, to produce a biased test. 

Two distinct features of this conceptualization of bias are as follows. First, it provides 
a mechanism for explaining how several individually-biased items can combine through a 
test score to exhibit a coherent and major biasing influence at the test level. In partic- 
ular, this can be true even if each individual item displays only a minor amount of item 
bias. For example, word problems on a mathematics test that are too dependent on so- 
phisticated written English comprehension could combine to produce pervasive test bias 
against English-as-a-second-language examinees. A second feature, possible because of our 
multidimensional modeling approach, is that the underlying psychological mechanism that 
produces bias is addressed. This mechanism lies in the distinction made between the abil- 
ity the test is intended to measure, called the target ability, and other abilities influencing 
test performance that the test does not intend to measure, called nuisance determinants. 
Test bias will be seen to occur because of the presence of nuisance determinants possessed 
in differing amounts by different examinee groups. Through the presence of these nuisance 
determinants, bias then is expressed in one or more items. 

The test bias detection procedure can simultaneously assess bias in several items, 
thus addressing the above two features. In contrast, most item bias procedures detailed 
in the literature perform tests on a single item at a time: The pseudo IRT procedure 
of Linn and Harnish (1981) estimates possibly group-dependent item response functions 
(IRFs) without the use of item parameter estimation algorithms when the sample size is 
too small for their use. Thissen, Steinberg, and Wainer (1988) employ marginal maximum 
likelihood estimation to obtain group-dependent item parameters in a 3-parameter logistic 
framework and use the likelihood ratio test to test the equality of the parameters across 
group. The Mantel-Haenszel procedure, adapted for test response data by Holland and 
Thayer (19SS), and which is in wide use, employs the practice of using the score of the 
entire test instead of the score of the non-studied items as the "matching criterion" to test 
for item bias. Etc. Conceivably these procedures could be used once for each item in a set 
of items being tested for bias, and multiple comparison procedures could be employed to 
assess the hypothesis of the entire set being biased. However, if the amount of bias is small 



in each item, a multiple comparison procedure may not pick up bias in the set of items at 
all. Moreover this approach cannot address underlying causal mechanisms of bias. 

The novelty of our approach to detecting test bias lies not so much with its recognition 
of the role of nuisance determinants in the expression of test bias, but rather in its explicit 
use of a multidimensional model to motivate the procedure to detect it. The presence of 
multidimensionality of test item responses where bias is present has long been recognized 
in test and item bias studies: Lord (1980) states "if many of the items [in a test] are found 
to be seriously biased, it appears that the items are not strictly unidimensional" (p. 220). 
Recently, Lautenschlager and Park (1988) employed a technique of generating simulated 
biased item responses using a method of Ansley and Forsyth (1985), which involves using 
multidimensional item response functions (IRFs)- and latent- ability distributions to deter- 
mine conditional probabilities of correct response. Kok (1988), taking a multidimensional 
viewpoint similar to Shealy and Stout (1991), presents a specific multidimensional IRT 
model for bias where the nuisance determinants are compensating abilities, contextual 
abilities such as language, and testwiseness. 

An important issue addressed by our procedure is that a careful distinction is made be- 
tween genuine test bias, often operationally embodied as DIF (Holland and Thayer (1988)) 
by practitioners, and non-bias differences in examinee group performance, sometimes called 
impact (see, for example, Ackerman (1991) for a careful discussion of impact as distinct 
from bias), that are caused by examinee group differences in target ability distributions. 
It is important that the latter not be mistakenly labeled as test bias. The procedure 
developed herein makes this distinction in its application. 



FORMULATION OF TEST BIAS 



Test bias in this paper is modeled using a multidimensional item response theory 
(IRT) model, which is assumed to be the model behind the observed test responses. For 
purposes of exposition, we restrict ourselves to the case where there is a single nuisance 
determinant; this two-dimensional modeling approach is often realistic in practice. Exten- 
sions to multiple nuisance determinants are straightforward. For a fuller treatment of the 
conception of test bias, including the case of multiple nuisance determinants and item bias 
cancellation, in a more general framework, see Shealy and Stout (1991) and Shealy (1989). 

We consider two biologically- or sociologically-defined groups, named "reference" and 
"focal" groups (after Holland and Thayer's (1988) naming convention). A random sample 
of examinees is drawn from each group, and a test of N items is administered to them. 
Typically it is suspected that a part of the test is biased against the focal group; this 
group is usually the object of the bias study. The responses to the test items from a 
randomly-chosen examinee are denoted JJ_ = (U 1 ,... ,?/#), where each can take on 
0 or 1, according as the response to item i is incorrect or correct, respectively. 

The IRT model in general is composed of two components that generate U\ (1) a d- 
dimensional examinee ability parameter and (2) a set of item response functions (IRFs), one 
for each item, which determine the probability of correct response for the items. Here we 
restrict the model to have d = 1 or 2, because we are considering a single nuisance determi- 
nant in addition to the target ability. The ability vector is (0, 77) for an arbitrary examinee 
from either group, where 9 denotes target ability and 77 denotes the nuisance determinant. 
A distribution of (0, 77) over the combined group of examinees is induced by choosing ex- 
aminees at random; the variable for a randomly chosen examinee is denoted (0,7?). The 
IRF for item i is denoted P,(0,t?), and it is assumed that all items depend on 9, and one 
or more may depend on 77; for those dependent only on 9, the IRF is P,(0). It is implicitly 
assumed that an IRT representation for U_ in terms of (0, 77) and {P,(0, 77) : t = 1, . . . , N] 
is possible; for a fuller treatment of this assumption, see Shealy (1989). In addition, it is 
assumed that each P,(0, 77) is increasing in (9, 77) when item i is dependent on both abilities 
and increasing in 9 when it is dependent on 9 alone; and that each P,(0) is differentiable. 
Finally, local independence of U_ given (9, 77) is assumed. 

Test bias in the above-mentioned model is formulated through three components: 

(a) The potential for bias, if it exists, resides within the target ability/nuisance determi- 
nant distributions of the two groups being studied; 

(b) potential for bias is expressed in items whose responses depend on the nuisance de- 
terminant; 1 and 

1 We remark that Kok's (19SS) formulation is also based upon (a) and (b); Kok's and 
our formulation were developed independently of one another. 



(c) the scoring method of the test, to be viewed as an estimate of target ability, transmits 
expressed item biases into test bias. 

Potential for test bias is explained prosaically in the following manner. After condi- 
tioning on a particular 0, suppose that the reference group has a higher level of nuisance 
ability on average than the focal group. Then those reference group examinees with abil- 
ity 8 would have an overall advantage over the corresponding focal group examinees when 
responding to items at least partially dependent on the nuisance determinants 77 (formally, 
because of the monotonicity of the items IRFs P t (0, J?)). Formally, we define the potential 
for test bias at 0: 

Definition 1. Potential for test bias exists against -the- focal group at target ability level $ 
with respect to rj if rj | 0 0, G = F is stochastically less than rj | 0 = 0, G = i?, where 
"G = F M denotes sampling from the focal group and "(7 = j?" sampling from reference 
group. Potential for bias exists against the reference group if the converse holds. 

Note that we are restricting consideration to conditional nuisance distributions rj\Q = 
$, G = R and rj \ 0 = 0, G — F that are stochastically ordered; that is, where the 
two distribution functions do not ; ntersect. Figure 1 displays two distributions that are 
stochastically ordered and also two distributions that are not. 



place Figure 1 about here 



In order for test bias to occur, it must be expressed in one or more items. Our definition 
of expressed bias for an item, when specialized to Kok's model, is really the same as that 



of Kok (1988, p. 269). It is defined in terms of a marginalization of the multidimensional 
IRFP.CM). 

Definition 2. Let P,(0,77) be the IRF for item t. The marginal IRF for group g (g = R 
or F) with respect to target ability 6 is defined as 

r^fl-^e.ifjie-^o-y]. (i) 

When rj \ 6 has a conditional density, f(r) \ 6) say, Definition 2 translates into 

J — 00 

Definition 3. Expressed bias for item z against the focal group occurs at target ability $ 
if T iF {8) < T iR {6)\ it occurs against the reference group if the converse holds. 

A test can consist of many items simultaneously biased by the same nuisance determi- 
nant. In this case, items can cohere and act through the prescribed test score to produce 
substantial bias against a particular group even if individual items display undetectably 
small amounts of item bias. This is the final (and novel) component of our formulation of 
test bias mentioned above. We consider the large class of test scores of the form 

m (2) 

where h(u) is real valued with domain us (tij,. .. , such that = 0 or 1 for i = 
1 , . . . , N and h(u) is coordinate wise non-decreasing in u . This class contains many of 
the standard scoring procedures for many standard models; for example, number correct, 
linear formula scoring of the form £SLi a,I7 f , with a f > 0, maximum likelihood estimation 
of ability for certain logistic models with item parameters assumed known, etc. In this 
paper we restrict attention to number correct as the test score; the results presented herein 
are easily extendable to other forms of h(u). The key point about number correct scoring 
is that each item is weighted equally. Thus, if a subset of the items is suspected of bias, 
we should give equal weight to the items in this "studied" subtest in our attempt to 
quantitatively assess the amount of test bias resulting from the simultaneous influence of 
thses items. We thus define test bias for a specified studied subtest of items as follows: 

Definition 4. Let {U {l ,17,^,... , U ib ] be any subtest of items to be studied for bias from 
the test of concern and define 

b 

fc(ID«£>i' (3) 
i=i 

Then this studied subtest of items displays test bias against the focal group at 6 if 

E[h{U) \Q = 9,G = F}< E[h(U) | 0 = 6,G = Rj. 
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The subtest is biased against the reference group if the converse holds. 

Finally, the components of the bias formulation can be integrated using the following 
theorem, adapted from Theoreir 4.2 in Shealy and Stout (1991): 

Theorem 1. Fix a target ability 0 and choose the subtest scoring method h(u) of the 
form (3). Assume potential for bias against the focal group at 6 holds (Definition 1). Then 
test bias exists against the focal group; i.e., 

6 b 
Yim, I © = 8>G = F) < £>[lf<, I 0 = 6,G = R). (4) 

In order to test for bias of the above form, there must be an implicit assumption that a 
portion of the test measures only the target ability; -otherwise; a conditional-on-observed 
score procedure to detect bias is not possible. This set of items will be denoted the valid 
subtest. The issue of the existence and identification of a valid subtest is extremely difficult 
to frame philosophically (it is really an issue of construct validity) and must primarily be 
an empirical decision based on expert opinion or data at least in part external to the test 
being studied; it is not dealt with here. For a fuller discussion, see Shealy and Stout (1991). 
For notational simplicity we denote the valid subtest to consist of first n < N items of 
the test, and we call the remainder of the N — n items the studied subtest We note that 
use of a valid subtest is operationally equivalent to making use of a subset of items whose 
purpose is to partition examinees into "comparable" sets as is done in the MH procedure 
described below and other DIF procedures. Hence, the proposed use of a valid subtest in 
the SIB procedure can be interpreted either in the strong sense of our test bias paradigm 
or in the weak sense of the DIF paradigm (of matching of "comparable" examinees). Thus 
use of our statistical procedure for assessing bias in no way requires acceptance of our bias 
framework as opposed to a "comparability" framework, where no claims about "bias" are 
made. 

Using the above conventions, the specification of test bias against the focal group at 
8 becomes 

T,(#) s £ T iF (S) < jr T iR (9) = T„(«) (5) 

i=n+l isn+l 

because T ig {6) — E[Ui | 0 = 0, G = g] by a simple application of a standard conditioning 
formula to Definition 2. T g ($) is called the studied subtest response function for group g. 

Unidirectional test bias 

Test bias heretofore has been considered conditional on a single target ability; we now 
turn to a global perspective. If there is test bias against the same group for all 9, then 
there is unidirectional bias against this group. Specifically, if 

BW - T R {0) - T F (0) 



is the level of bias against Group F at 9 ) then unidirectional bias holds if either B{9) > 0 
for all 9 or B{9) < 0 for all 0. A strong form of unidirectional bias, termed uniform 
bias by Mellenbergh (1982), is the type of bias that the modified Mantel-Haenszel test 
statistic devised by Holland and Thayer (1988) is designed to detect. Although the Mantel- 
Haenszel approach is not dependent on an IRT framework, it can be put in a Easch 
model IRT framework, with the single biased item having group-dependent item difficulties. 
Here, the bias is "uniform" in the sense that T F {9) is merely T R {9) shifted horizontally. 
Unidirectional bias is less restrictive in that T g {9) does not have to be a logistic IRF, and 
more importantly, T R (9) does not have to be T F {9) shifted. 

Since we are concerned with bias against the focal group, it is intuitive that a suitable 
theoretical unidirectional bias index is 

fa = Jbwmwb (6) 

where fp(9) is the probability density function of 0 for the focal group. Equivalent in- 
dices weighted by the reference target ability distribution and the combined-group target 
distribution are easily conceptualized. 

THE BASIC PROCEDURE 

The statistical procedure to be presented is based on (6); the hypothesis is 

H :/3 v = 0 vs. fa > 0, 

the alternative being one-sided to specifically test for bias against the focal group. The 
test statistic to be constructed is essentially an estimate of fiy normalized to have unit 
variance. The estimate of /3y is derived first. 

Since test bias is analyzed using number correct on the studied subtest, set 

i=n+l 

to be the studied subtest score; also set X = YH=\ U\ ^° De the valid subtest score. In 
selecting the valid subtest score to be number correct, we follow the convention set out in 
Holland and Thayer (1988), among many others. Other choices would of course be possible 
and could improve the performance of the procedure. 

The naive intuition is that examinees with the same valid subtest score are examinees 
of approximately equal target ability and thus such examinees are directly comparable in 
the assessment of bias. Thus the difference 

y R k-YFk> fc = 0,...,n, (8) 
8 



where Y gk is the average Y for all examinees in group g attaining valid subtest score X = 
should provide a measure of the bias against the focal group (resulting from the reference 
group having superior nuisance ability r\ on average). In particular, if there is no bias (H 
holds), then Y Rk — Y Fk = 0 for ail k should be observed, and if there is unidirectional 
bias against the focal group {B{9) > 0 fcr all 9) then Y Rk — Y Fk > 0 for all fc, except for 
statistical error, should be observed. 

The above assertion needs support; it will suffice to argue that 

E[Y Rk - Y Fk ) = 0 for all k if B{9) = 0 for all 0, and 

E[Y Rk - Y Fk ] > 0 for all k if B{9) > 0 for all 9. ^ 

For now we restrict the target ability distributions to be equal for the two groups; i.e., 
0 | G = R and 0 | G = F have the same distribution. It is easy to prove (following (5)) 
under the model presented herein that 

E[Y 8k ) =E[Y\X = k,G = g] = E[T g (Q) \X = k,G = g). (10) 

Now assume that the valid subtest is long enough so that the distribution of 0 | X = fc, 
G = g is tightly concentrated about its mean, and hence that T g (9) is locally flat within 
the range of 9 where the distribution of 0 | X = fc, G as g mostly resides. Then 

E[T g (e) | X = *, G m g) * T g (E[© \ X = fc, G - g}) (11) 

= T g (E[G | X = fc]), 

because the two target ability distributions are equal and expectation is a linear operator. 
Thus, denoting 6 k = E[0 \ X = fc], 

«Ru-^«]-W»). (12) 

Thus (9) follows easily; the n + 1 differences in (8) provide an estimate of B(9) at n + 1 
points in the 0-domain. It is intuitive that an estimate of fly is 

n 

where p fc is the proportion (among focal group examinees) attaining X = fc. Specifically, 
if 7 tf jj. is the number of examinees in group g attaining X as fc, then p fc = t/ Ffc / £jL 0 Jjrj.. 

In the case where the target ability distributions are the same, then, it is straightfor- 
ward that 

= £>m) Mc; (14) 
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where p k = P[X m k | G » F], Thus the expected value of fry is a weighted difference 
of marginal IRFs, this weighted difference approximating which is a continuously 
weighted difference of marginal IRFs. From (14), it follows that Efly = 0 if fly = 0, and 
Eflu > 0 if > 0. This suggests the standardized test statistic 

3 - (15) 

for testing if, where the denominator is defined as 

*(&,) = (pPl ( ^ 2 (k ! fc.fl) + I ) , (16) 

where a 2 (y | k,g) is the sample variance of the studied subtest scores of those group g 
examinees with valid subtest score k. A full description of the computation of the test 
statistic, with contingencies for exclusion of certain valid subtest scores based on inadequate 
examinee counts, is presented in the Appendix. B is approximately standard normal when 
/3 V = 0 and the target ability distributions are the same, because /9y is the weighted sum 
of approximately normal random variables Y Rk —Ypki these are approximately normal (for 
suitable sample sizes) by the central limit theorem (proof of asymptotic normality of B 
omitted). 

The regression correction for target ability difference 

The presence of a difference in target ability distributions in test bias studies has been 
treated in various contexts in the literature. The issue of the linking of metrics across group 
in the estimation of IRT item parameters is one such context (see Linn, et al (1981) for an 
IRT item bias approach where linking of metrics is crucial). Holland and Thayer (1988) 
also deal with this problem by including the single studied item in the matching criterion 
score of the Mantel-Haenszel test; they prove that this method completely compensates 
for target ability difference (in their context, the distributional difference in the postulated 
unidimensional latent trait) when the underlying IRT model is a Rasch model. Millsap 
and Meredith (1989) elegantly formulate the problem in terms of a divergence of two 
hypotheses (a "conditional on observed score" hypothesis and a "latent trait" hypothesis), 
which would occur if target ability difference is present. A "conditional on observed score" 
procedure such as (15) in its present form is not adequate to address the separation of 
target ability difference from test bias; the presence of target ability difference when in 
fact there is no test bias present can statistically inflate 2?, thereby suggesting test bias 
actually is present. It is therefore necessary to formulate a correction for target ability 
difference. 

10 

13 



To motivate the proposed correction it is necessary to show that a decomposition of the 
differences Y Rk — Y Fk into "test bias only" and "target ability difference only" components 
is possible. First we note that by similar arguments to those used in deriving (10) and (11), 

E[Y gk ] = T,l$, u ) t (17) 

where 8 gk m £[0 j k,g]. The condition E[Y Rk - Y Fk ] = 0 requires 8 Rk == 8 Fk > as in (11) 
where g was removed from the conditioning; but this may not happen if the target ability 
distributions are net the same, as Figure 2 suggests. Figure 2, which displays densities 
for four distributions, assumes that the distribution of 0 | F is stochastically smaller than 
that of 0 | A 



place figure 2 about here 



Note that the (conditional) distribution of 0 | k,F is stochastically smaller than that 
of 0 | k, R for all k. The standard Bayesian calculation makes this insight rigorous. Thus, 
6 Fk < 6 Rk for all fc, and, in the absence of bias, where T R (8) = T F (6) = T(8) for all 0, 

EY Fk ±T(8 Fk )<T(8 Rk )±EY Rk 

(T(8) is assumed monotone; for mild conditions giving such monotonicity, see Shealy and 
Stout (1991)). Thus 

k=Q. 

In the case where bias is present, we can thus decompose E[fiu\. 

E0u\ - Y,Pk(T R (e Rk ) - T F (6 Rk )) + j2Pk(T F (6 Rk ) - T F (6 Fk )) 

k T n k= ° (18) 

- 52p k B(e Rk ) +Y tPk r F (0i)(0 Rk - e Fk ), 

k=Q k=Q 

where 8 k is between 8 Rk and 8 Fk . (T F (8) is assumed differentiate here and the mean 
value theorem has been applied.) The first term is due only to test bias; the second is due 
only to target ability difference. 

U 14 



This approximate decomposition argument is the motivation behind the proposed 
correction. Our strategy is to adjust Y Rki Y Fk to Y Rki Y£ k such that the inflating effect of 
the group differences in target ability is eliminated. The manner this is accomplished is to 
construct Y Rk and Y Fk so that they are estimating the studied subtest response functions 
T R {6) and T F {6) at approximately the same target ability B k defined below (as opposed 
to two different ones, as is evident from (17)). 

A natural attempt to make adjustments to Y Rk and Y Fk is to approximate T R (6) and 
T F {9) in the neighborhood of 9 Rk and 9 Fk by linear functions. If we assume that 6 Rk and 
6 Fk are sufficiently close together to do this, T R (9) and T F (Q) can be linearly interpolated 

at d k = 2^ Rk + d Fk) : 

T g (6 k ) = T g (6 gk ) + m gk (6 k -6 gk ) 

where 

m r,p,, t + 1 )-r < tf„»_ 1 ) i 

m gk - 7 a ' 

however, though estimates of T g (6 gk ) (namely, Y gk ) ars available for all fc, estimates for 
{8 gk : k = 0,. .. ,n} are not. Abilities on the 0-scale are not observable; however, one can 
estimate abilities on the scale defined by the valid subtest, namely 

v = P{6) 

where P{6) is the average of the valid subtest IRFs £ P(0) | G = g is the 

true score for a randomly chosen group g examinee, i.e., the valid subtest true score P(0) 
for group g. Let 

V 9 {*) = E[P{G)\X-x % Gmg\ % (20) 

the (theoretical) regresion of true on observed (here, valid) score. V g (x) can be easily 
estimated using classical true score theory, assuming that the above regression is linear or 
nearly so. The estimation of V g (x) is deferred to the appendix. Denote this estimator by 
V g (x). 

At this point it is expedient to describe three latent scales, which must be simulta- 
neously considered in order to understand the correction. Figure 3 delineates the three 
scales and should be referred to frequently. 
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So, the interpolation of (19) must be transformed so as to use the easily estimable 
V g {k) instead of 6 gk . Through a monotonic transformation P(0), V g {k) and 9 gk represent 
approximately ("approximately" because P(9 gk ) = V g {k) will be demonstrated below) 
the .same ability on two different latent scales and thus for our purposes interchangeable. 
Note that s — T g (B) defines a monotonic transformation from the fundamental latent 
scale to the studied subtest scale, and v — P(9) defines one from the fundamental scale 
to the valid subtest scale. T g {9) must be transformed so we can use the valid subtest 
scale as domain, because abilities on this scale can be estimated. Figure 4 illustrates the 
appropriate correspondence, 
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thus defining a new transformation S g {y) — T ff (P -1 (v)) from valid subtest scale to studied 
subtest scale, with domain (c, 1) and range (c, 1) (c > 0 is the guessing parameter, assumed 
common for all items in the test). 

With this transformation in hand, the correction can be performed in the following 
manner. First, by the same arguments as used in (10) and (11), using P(9) in place of 
T g (6) in the arugments, 

V g (k)±P(E[Q\k i9 ]) = P(9 gk ). (21) 
So P- l (V g (k)) = 6 gk by continuity; and 

T 9 (P-\V 9 (h)))±T g {0 gk \ 
13 
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also by continuity. By definition of S g (v), this becomes S g (V g (k)) == T g (6 gk ) t and thus 
by (17), 

Ef, t = S,(V f (*)). (22) 

Thus Y gk is a reasonable estimation of S g (V g (k)) for each k. To transform (19) into 
an interpolation involving S y (0> we assume that S g (v) can be approximated by a linear 
function in a small region about V g (k) } and that V R (k) and V F (k) axe close enough to 
allow the approximation to be effective. Then, we interpolate S R (V R (k)) and S F (V F (k)) 
to their respective values at V* = f^rt^) + ^ F (k))\ 

S g (V k ) ± S f (V f (*)) + m* gk (V k - (23) 

where 

m y* v f (* + l)-V f (*-l) 

is the approximate slope of S g (v) in the region of V g (k) and V*. All of the above terms on 
the right hand side of (23) are estimable; using Y gk to estimate S g (V g (k)), we define the 
adjusted Y* k \ 

Y g \ = Y gk +M gk (V k -V g (k)) (24) 
where, recalling that the estimator V g (x) is given in the Appendix, 

'* V,(k + 1) - V,(k - 1) 

and define V fc = | (V^(fc) + V F (k)). Because the right hand side of equation (24) is a good 
estimator of the right hand side of (23), Y gk is thus a good estimator of S g ( V k ). Finally, Y gk 
must be shown to be a good estimator of T g {9) at the same 6 for both groups. By definition 
of S g {v), S g (V k ) = T g (P^(y k )). If 6 Rk and 6 Fk are sufficiently close together then P(6) 
may be taken to be approximately linear in the neighborhood of 9 k — {B Rk -\-6 Fk )/2. Thus, 
using (21) and assuming approximate linearity of P in the neighborhood of 6 k , 

V k = \(V R (k) + V F (k)) 

±\(P(0Rk) + PV F k)) 

=m). 

Thus, by the continuity of P(9), 

8 k = P- } (V k ). 
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Hence, by the definition of S g (v) 

5 5 (n) = r fl (p- 1 (v fc )) = T fl (^). 

Thus, because Y gk has been shown to be a good estimator of S g (V k ), it is shown that 
Y* k is a good estimator of T g (9 k ). Thus, Yfi k - Y£ k , as desired, is a good estimator of 
T R (9 k ) — T F (9 k ), i.e., of the difference of the marginal IRFs at the same 9, establishing 
the usefulness of the interpolation (19). 

(24) is called the regression correction for target ability difference. Thus, with the 
correction (24) in place, (13) can be reconstructed, with 

and B defined as in (15). Rejection of the hypothesis of no test bias (H : fly = 0) occurs 
when B > z al where P[A r (0, 1) > z Q ] = a defines z a . This procedure will be referred to 
as the SIB procedure, "SIB" for simultaneous item bias. 

Thus, the contribution to the differences Y Rk — Y Fk due to target ability difference 
has been eliminated. It is extremely instructive to note that the correction (24) is the 
sample analogue of (23), which is basically the decomposition (19), albeit on a different 
latent scale (though the two latent scales, S and V, are indistinguishable up to a monotonic 
tranformation). 

A modification of the basic procedure to achieve better statistical behavior 

Redefine p k to be the proportion of all examinees (focal and reference group) attaining 
X = k. That is p k = (J Fk + Jru)I £*=o(^f* + ^Rk)* Substitute this new p k into (25) 
and (16) to obtain the statistic B of (15). Because of a slightly better adherence in 
simulation studies to the nominal level of significance when the hypothesis of no test bias 
holds, this new choice of p k is recommended over the slightly more intuitive choice based 
upon focal group examinees alone. The power performance of both versions of B when 
test bias was present was very similar. It is upon this version of the SIB statistic that our 
simulation studies reported below are based. 

SIMULATION STUDY 

In order to assess the performance of the procedure in a variety of testing situations, 
a moderate-sized (84 simulation cases) simulation study was performed. Three parameter 
logistic item parameters actually estimated from two test data sets, an ACT math test 
(estimated by Drasgow (1987)) and an ASVAB auto shop test (estimated by Mislevy and 
Bock (1984)), are used to specify the IRFs in the IRT model. Univariate and bivariate 
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normal ability distributions, appropriately centered relative to the test item parameters 
(for the purpose of good measurability of target ability), are used for the focal and reference 
groups. Two levels of bias and three levels of target ability difference are simulated; tests 
with a singly-based item and with three biased items are used in the simulations. The level 
of guessing in the tests is varied. Finally, group size pairs of (3000, 3000), (3000, 1000), 
and (1500, 1500) for the reference group and focal group examinees respectively are used. 

Each simulation model is run 100 times (trials). For a particular simulation model, the 
item parameters and the two ability distributions for the two groups are fixed; however, 
at each trial, a new set of examinees (ability parameters) is generated from the ability 
distributions. 

When a single item is to be studied in a simulation, the Mantel-Haenszel procedure as 
modified by Holland and Thayer is run in parallel in order to provide an external reference 
to compare to and to compare our procedure with. 



Item parameters 

Estimated item parameters from the above mentioned tests were used to construct test 
models; the ASVAB test length is 25, and the ACT test length is 40. Table 1 gives the sum- 
mary statistics for the a's, b's, and c's as estimated by Mislevy and Bock and by Drasgow; 
for the actual parameter values, see Mislevy and Bock (1984) and Drasgow (1987). 
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The test for each simulation was generated in the following manner. Let N denote 
test length and n b the number of items io be studied for possible bias. First, n b was chosen 
to be either 1 or 3. There were two cases to consider. 

1. No bias: unidimensional items are used for the entire test. 

2. Bias: unidimensional items are used in the valid subtest, and 2-dimensional items are 
used in the studied subtest. 
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In the first case, n b of the N items were chosen randomly to be the studied ones, and 
the remainder were used as the valid subtest. In the second case, n = N — n b items were 
chosen at random from either the ASVAB or the ACT test to be the valid subtest, and 
the 2-dimensional studied item parameters were chosen according to Table 2. Note that 
the studied item guessing parameters are a function of the average and standard deviation 
of the guessing parameters on the ASVAB "or ACTTests; the studied item a's and b's are 
the same for both tests. 

The IRFs are for case 1 (no bias) 

m-«+ T+sa$A =m i = h -' N ' (26) 

where a i9 and b ie are the target discrimination and difficulty for item t. In case 2 (bias), 
items 1 to n were of the form (26), and items n + 1 to N (studied items) had IRFs 

W.*)-«*+7T ( i n< /I " k\ _l ta u m * = " + % N. (27) 

l + exv{-lJ(a ie (6-b i0 ) + ai^e-b ifl ))) 

The final factor in determining the item parameters wets whether or not to include guessing; 
that is, whether to assume 2PL or 3PL modeling. The presence of guessing is thought 
to influence the performance of the procedure. Thus, in some simulation models, the 
estimated c t -'s from the literature were used in conjunction with (26) and (27); in others 
all c^s were set to 0 producing a 2PL model. A detailed description of the experimental 
design of the simulations follows. 

Ability distributions 

Specifying the ability distributions involves choosing the five parameters determining 
the bivariate normal distributions for each group in such a way to meet the following goals: 

1. Introduce a specified amount of group difference between target ability distributions. 

2. Require the test to measure the target ability well, as would be true for any "good" 
test. 

3. Introduce a specified amount of potential for bias into the distributions. 

4. In the case of 2-dimensional studied items (bias case), require that examinee nuisance 
abilities be influential in determining the response to the item, e.g., that target and 
reference group examinees have moderate nuisance abilities. 
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Each goal is elaborated upon separately below. The bivariate distributions for group g 
(g = R or F) is denoted 

Ri:)-»IC:Mi ;]] 

where p = Corr(0, rj \ G = g) is taken to be the same for both groups (p taken to be 
different across group tends to introduce bidirectional bias, where marginal IRFs in 6 for 
the two groups cross; see Shealy (1989)), Note that a 2 (0 | g) and o 2 {rj | g) are taken to 
be 1 in our study. 

Goal 1. We first define target ability difference. We need some notation; let a R = 
the proportion of the entire (conceptual) population of examinees who are referece group 
members, and a F » 1 — be the corresponding proportion for the focal group* (Note: 
as J R and J F both increase to oo, conceptually, j^j f — > <* R and j/+j f — > <*f« Here J ff 
denotes the number of sampled Group g examinees*) Define 

d T = VeRZJhF (29) 

to be the target ability difference between the focal and reference groups, where 

a] P = a R a 2 (Q \ R) + a F a 2 (Q | F). (30) 

Note that when (28) holds a\ p = 1 and thus that d T = p, 6R — tx eF . d T is a quantity 
specified in the simulations. 

Goal 2. The criterion used to ensure good measurability of 6 by the test, is that the 
average difficulty (6) of the valid subtest should be close to the average target ability over 
the pooled groups. Specifically, fi eR and n 6F are chosen so that 

6 = E[6] = a R n eR + a F fi eF . (31) 

6 is taken from Table 1. ii 6R and y. eF are completely determined by specification of d T 
and (31). 

Goal 3. We use a more restrictive version of Definition 1 to define potential for bias: set 

C 0 {6) - E[ff | 0 = 9 % G m R] - E[rj | 0 = 9,G - F). (32) 

CfiW > 0 * s defined to be the potential for bias against the focal group. When (28) holds, 
(32) becomes 

C p {9) = Cp= n nR - p\i 9R - (n nF - pn 9F ) 

= (t* v R ~ H n f) ~ P&6R ~ VBF) = (PyR ~ ^f) ~ P d T> 
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$ dropping out because the ability correlation (p) is equal for both groups. Note that 
because Cp is constant for all 0, unidirectional bias is being introduced. For a specified 
amount of /i,^ and ii n p are determined partially. The reader should note that potential 
for bias can hold even though ii nR = p^p unless n 6F = n 9R . 

Goal 4. The criterion used to ensure nuisance determinant influence is the following. The 
nuisance difficulties for all studied items were chosen to be 0. For an arbitrarily chosen 
target ability (say 0 = 0) we thus want the average nuisance ability to be near 0 as well. 
Thus we choose 

E[rj\Q = Q,G = R} = | 0 = 0,G = F] (34) 

i.e., the conditional nuisance expectation at 0 = 0 is to. be. centered around the average 
studied item nuisance difficulty of 0, for the reference and focal groups. Our intent in this 
study was to introduce bias against the focal group, so E[rj | 0, R] > 0 in (34) and thus we 
get 

0 < A*t,h ~ PPor = ~0V ~ PP9f)\ ( 35 ) 

this will specify n^ R and n nF , along with specification of Cp in (33). 

There is an additional issue here: how large should Cp be chosen to introduce a 
"moderate" or "severe" amount of bias into the 2-dimensonal studied items of Table 2? 
This is treated below, in the experimental design of the study. 

Goals 1-4 now completely specify (28): ti 0R) n eFi n nRi and n nF can be found by 
solving (29), (31), (33), and (35) simultaneously for them. />, a 2 {B \ g), and a 2 (r) \ g) are 
chosen: p = .5, and all a's are 1. 

Choice of C ? 

The amount of potential for bias Cp in each simulation model was chosen so that the 
actual level of bias /9y produced was such that the power behavior of the statistic can be 
well assessed for the given examinee sample sizes, valid subtest used (recall Table 1), and 
biased items used (recall Table 2). These /3 V values (rounded to two significant figures) 
are shown in Table 3. The governing equations determining Cp from /9y were 

where 

T,(0)= £ £[P,(0,»?) I 0 = *, = (36) 
with Pi(9,ri) defined in (27) and the item parameters in (27) defined in Table 2, and the 
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parameters of the (©,77) distribution determined from (29), (31), (33), and (35). One 
standard often used to interpret from a practitioner's viewpoint the magnitude of the bias 
is that the bias is "moderate" if 0.5 < A ww < 1 while it is "large" \i A MfJ > 1, where 
A Af H is the theoretical index based on use of the Mantel-Haenszel log odds ratio proposed 
by Holland and Thayer (1988). The rationale for A MH and j3y are different, but for n b = 1 
and unidirectional bias, they tend to be highly correlated and are crudely related by 

Thus, roughly, 0.05 < fly < 0.1 would constitute moderate bias while fly > 0.1 would 
constitute large bias. Thus in the n b = 1 case, referring to Table 4, the amount of bias 
being simulated is actually either (low) moderate or small. Examination of (36) shows that 
is a measure of how much lower the probability of getting the biased item right is for 
an average focal group examinee as compared with an average reference group examinee 
of the same target ability. Thus fly has a natural and useful empirical interpretation. In 
our context, A A by contrast, is a measure of horizontal distance between T R {9) and 
T F {B) at y = if* (i.e., the value of T^((l + c)/2) - T^((l + c)/2)), where c is defined 
in Table 1. 
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Experimental design 

The design is as follows. For the case of no test bias (C^ = 0), for each test type 



(ASVAB Auto Shop or ACT Math) the following simulations are done: 



r 0.0 > 




' 3000/3000 ) 


i 0.5 


i x J r /Jf — < 


3000/1000 > 


1 1.0 J 




k 1500/1500 J 



= { I } x d T - 

| guessing 1 
1 no guessing J ' 

Here "guessing" means that the estimated ACT and ASVAB guessing parameters are used 
in the model and "no guessing" means that all cs are set to zero; that is, 2PL modeling 
is used. Also, "D" means that this guessing "factor" is randomly assigned within the 
36 levels produced by crossing the other factors. 

For the case of test bias (C^ > 0) the following simulation are done for each test type: 

, v f . r o r 3000/3000 

^-{i}^-{S} xC '-{S}*^-{s^ 



f guessing ) 
\ no guessing J 



For n b = 1, the nuisance discrimination a NlJ of the studied item is .8; for n b = 3, the 
nuisance discrimination of each of the 3 studied items is .4. These discriminations were 
chosen so that the power of the procedure could be well assessed (i.e., so that it would not 
be too close to 1). It is informative to note in passing that the power of the procedure 
is expected to be greater when n b is increased from 1 to 3 unless each item individually 
displays less bias in the n b = 3 case. This is why the a ir} (i = N - 2, N - 1, N) was chosen 
to be .4 in the n 6 = 3 case, £ of that used in the n b — 1 case. 

There are therefore 48 simulation models that incorporate bias. Thus, a total of 
84 simulation models were used in the simulation study. 

RESULTS OF THE SIMULATION STUDY 

The results of the simulation stidy are given in Tables 5-8 and 9-12, with Tables 5-8 
summarizing the no test bias simulations and Tables 9-12 summarizing the simulations 
having test bias present. The c column indicates whether the model has guessing present 
or not. In all n b = 1 case* the Mantel- Haenszel rejection rate for the hypothesis of no item 
bias (based on 100 trials) is reported in the MH column. In all cases the SIB rejection rate 
is reported in the SIB column. In all cases where test bias is present (Tables 9-12), the 
C,} column presents the amount of potential for bias present (recall (33)); the fly column 
presents our index of the amount of bias present against the focal group in the model 



21 O a 
**** 



(recall (6)); /?y is the average of the estimates 0 V of py over the 100 trials; the A MW 
column presents the amount of bias present against the focal group in the model from the 
Mantel-Haenszel perspective. 

Tables 5-8 indicate that both the SIB statistic and the MH statistic display reasonable 
adherence to the nominal level of significance of 0.05. There appear to be situations of 
no bias, which have a target ability difference and which depart from the Rasch model, 
where the Mantel-Haenszel procedure displays inflated Type 1 error. (See Zwick (1990), 
for a discussion of this problem and an illustrative example.) There is evidence that 
in such situations (Shealy (1989)), the SIB statistic adheres closely to the nominal level 
of significance. On the other hand there are likely portions of the "parameter space" 
of realistic IRT models where our linear regression correction is stressed and hence the 
MH would likely display better Type 1 error performance. More study is required before 
it can be claimed that either MH or SIB displays superior Type 1 error performance. 
The striking fact is that both procedures seem to be quite robust against the inflating 
Type 1 error effect of differing target ability distributions. In this regard, d T = 1 from the 
practitioner s perspective is certainly a large amount of target ability difference. 

Tables 9 and 11 indicate that both the SIB statistic and the MH statistic are quite 
powerful against moderate amounts of bias and fairly powerful against small amounts of 
bias in a single biased item. Untabulated simulation studies for larger amounts of bias 
produced rejection rates of essentially unity for both the SIB and MH procedures. 

Tables 10 and 12 indicate that the SIB procedure is quite powerful against moderate 
amounts of bias resulting from several (3 here) items producing bias in the same direction. 
The reader should recall that the amount of bias/item was lowered for the n 6 = 3 case by 
reducing the discrimination in the nuisance dimension from a nN = 0.8 to a vi = 0.4 for the 
studied items. In both the n b = 1 and n b = 3 cases, the potential for bias as measured 
by C$ was kept the same (Cg = 0.2 or 0.3). These two table show, as claimed, that the 
SIB procedure can successfully detect simultaneous item bias, even if the amount of bias 
present per item is small. 

Tables 9 and 11 show, for the particular bias models of the simulation study, that SIB 
is somewhat more powerful than MH, averaging 0.07 higher for those models for which 
rejection rates are < 0.9. We do not know whether this greater SIB power generalizes to 
other models of bias. 

Tables 9-12 provide evidence about the ability of 0 V to estimate our measure of 
the amount of bias present. For each case fiy is an indicator of the amount of statistical 
bias one might expect in using fly. Clearly statistical bias of roughly +0.01 is present. 
The estimated standard errors for P v are not recorded, but averaged (roughly) about 1/3 
of fly. Thus if py = 0.05 there is likely a bias of 0.01 and a standard error of 0.017. Thus, 
crudely, a 95% confidence interval (if asymptotic normality is a good approximation) would 
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be given by 0.04 ± 0.028. Here 0.04 = 0.05 - 0.01 is the correction for statistical bias. It 
would seem that $u provides a useful empirical index of the amount of bias present in a 
statistical subtest of items; more work is planned in studying its theoretical and empirical 
properties. 

SUMMARY AND CONCLUSIONS 

The SIB procedure was designed to test for unidirectional test bias residing in one or 
more items, using the conception that test bias is incipient within the two groups' ability 
distributions (in terms of a difference in conditional nuisance ability distributions). By 
means of the regression correction presented here, the inflation of the SIB test statistic 
due to target ability difference (one group having a stochastically larger distribution of 0) 
is extracted. This correction represents a conceptual link between conditional-on- observed- 
score methods and IRT-based methods, just as the practice of including the studied item 
in the comparable examinee criterion in the Mantel-Haenszel procedure of Holland and 
Thayer (19S8) does. The correction adjusts the studied subtest scores for the two groups so 
that they are now estimates of the same latent IRT ability in the case of no test bias, even if 
group target abilities exist. It is useful to note that the adjustment, although conceptually 
based upon multidimensional IRT modeling, is in fact computed using a classical approach 
and hence does not depend on IRT ability or item parameter estimation. 

A moderate (84 models) simulation study shows that both MH and SIB display good 
adherence to the nominal level of significance, even for large (d T = 1) target ability differ- 
ences. In the case of a single biased item, both MH and SIB display good power with SIB 
displaying slightly higher power. As designed, the SIB statistic displays good power in the 
case of several biased items (3 here), even when the amount of bias/item is fairly small. 

A large scale simulation study is in progress with the goal of obtaining a better un- 
derstanding of the performance characteristics of both the SIB and the MH statistics with 
particular emphasis on investigation of statistical power and adherence to the nominal 
level of significance. Based upon the completed portion of this simulation study reported 
herein, we would recommend that practitioners use the SIB and MH statistics simultane- 
ously. Both are extremely easy to compute and for moderate sized data sets run quickly on 
a typical PC configuration. Carefully checked code with a user oriented driver is available 
from the authors for running both the SIE and MH statistics on real data sets and also 
for doing simulation studies of performance. 
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APPENDIX 

1. Derivation of the estimated regression of true on observed valid 

subtest score, for k — 0,. . . ,n. 

Recall that V g (k) = E[P(0) \ k,g] needs to be estimated in order for S g {V k ) of (23) 
to be estimated. Suppressing g for simplicity, we need to estimate V(k) at k = 0, 1, . . . ,n. 
Although V(k) is not necessarily linear in k (see Shealy (1989), p. 87ff for a discussion), 
as an approximation we assume nV{k) is linear in k\ i.e., 

nV(k) = a + 0k. 

To estimate V(fc), we consider the true score model for the valid subtest score X: 

X=T + e (Al) 

where 

E(e) = 0, cov(T, e) = 0 ( A2) 

is assumed and the true score T has the latent variable representation T — nP(0). Thus 



nV(k) = E[T | *]. 



Standard regression theory for E(T | k) yields 



V(k) = ^(ET+ E*pr(k - EX)^ . 



(A3) 



But, for the true score model given by (Al) and (A2), 

Pxt u t _ i _ g2 ( e ) 

a x o\xy 



(A4) 



is well known (see page 61 of Lord and Novick (1968). Using (Al) and (A2), ET = EX 
holds. Thus, by (A3) and (A4), 



V{k) = i 
n 



(A5) 



holds. 

Clearly EX = f^f-X" | ^] can be estimated by the average valid subtest score X g 
of all Group g examinees taking the test. Thus it remains to estimate o 2 (e)/o 2 (X). 
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a 2 (X) s a 2 {X | g) can clearly be estimated by the usual sample variance estimate of all 
Group g examinees taking the test 

i\X I ff) = p^i) - X,f , (A6) 

where 7 ff denotes the number of Group # examinees taking the test and X g j is the valid 
subtest number correct score of the jth such Group g examinee. It remains to estimate 
a 2 (e); denote this estimation by a 2 (e). Then the desired estimation of a 2 (e)/<r 2 (X) will be 
given by a 2 (e)/a 2 (X). A standard conditioning formula yields, indexing the valid subtest 
items by i = 1, 2, . . . , n, and setting X g = X | g, Q g - 0 | g as a reminder that sampling 
here is from Group g only, 

a 2 (X | g) = a 2 (X s ) = a 2 (£[* s | 0,]) + £[a 2 (X, | 0,)] 

= a 2 (nP(0 s )) + ^ f;[P,(0 s )(l - P,(0 S ))], 
i=i 

using the standard item response theory assumption of local independence of items, given 0. 
Also, by (A2) it is trivial that 

a 2 (X\g) = <7 2 (nP(G)\g) + o 2 (e\g). 

Thus, by (A7), 

a 2 (e\g) = jTElP i (Q 9 )(l-P i (O g ))}. 
i=i 

This suggests 

»*(« I •) - XX* 1 " ff n>« (A8 > 

1=1 

where U i is the proportion correct for Group g examinees for valid subtest item i. Thus, 
using (A5), we will estimate VAk) by 



(i ^iilii Vt X ) 
[} d>(X\g)) {k X °\ 



(A9) 



2. The complete procedure to detect test bias, using the proposed regres- 
sion correction. 

The SIB procedure in its entirety is presented here. First we set some basic notation. 
Group g (g = Rot F) has J g examinees taking the test of N items. The response to item i 
of the jth group g examinee is U gij . The subtest scores are 

n N 
X gj = ^ U gij ( valid subtest score )> Y gj = U M ( studied subtest score )- 

1=1 i'=n+l 
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The classical group item difficulties are U gi = (l/*/ y ) H;=i Ugij- Let denote summa- 
tion over those group g examinees j with k correct on the valid subtest. 

1. Compute J gk , the number of group g examinees with k correct on the valid subtest. 

2. Compute 




If J gk = 0, set Y gk = 0; if J gk < 1, set S gk = 0. Y gk is the sample average studied 
subtest score of group g examinees attaining X g = fc, and S gk is the sample variance. 

3. Compute P g (k) = J gk / J gy for both groups and all k. P g (k) is the estimate of the his- 
togram of X | G = g. Then compute P g (k), the MLE of the unimodalized histogram 
of X | G = g, over the class of all possible unimodal MLE of the histograms with n + 1 
possible values (X \ G = g is assumed to have a unimodal distribution and hence its 
estimate {<P/(&), k > 0} should also be unimodal). For details of this procedure, using 
the up-and-down-blocks algorithm, see Barlow et al. (1972; pp. 72-73; pp. 223-231). 

4. Set I(k) = 1 for all k unless either 

(a) k = 0 or n, 

(b) S% k = 0 or S* Fk = 0, 

(c) J R Px(k) < J min or J F Pp(k) < J min where J min is set by user, usually around 30, 
or 

(d) k < ncy, where Cy > 0 is the user-specified global guessing parameter for the 
test. (It is assumed that there is a relatively constant level of guessing across 
item, and that there is at least partial knowledge of this guessing value.) 

J(fc), k = 0, ... ,n, is the examinee inclusion indicator; it is 1 if examinees with 
X = k are to have their responses included in the test statistic, (a) excludes the two 
extreme valid subtest scores because of their poor estimation of target ability. The 
(b) exclusion is obvious. The (c) exclusion is done to assure that each valid subtest 
score category has enough examinees to make Y Rk and Y Fk approximately normal; the 
unimodal mass function is used so that only extreme valid subtest score catagories are 
excluded. As for (d), all valid scores below that expected by guessing are excluded. 

5. Compute the regression of true score on valid subtest score: 

(a) Ugi = ley 17 * K the result is < 0, set it to 0 (adjustment for guessing). 

(b) x g = £ Efc x 9i 

(c) | = ^ 

(d) a 2 (e | g) = £ti " %) 

\ Q ) °9 - n-1 V 1 oHX\g)J 
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(f) V g {k) - ±(X g + b g {k- X g )) for both g and k = 0, . . . , n. 
6. Make the regression correction: 

(a) fc| = min{fc : J(fc) = 1}, k r = max{fc : I(k) = 1}. 

(b) V, = + VV(fc)), ^ < * < V 

(c) For k t < k < fc r , compute 



Then compute Y* k = f ffJfc + - V g (k)). 

(d) For fc = A:^ and k = fc r , compute y s * fc in-the following way. 
i. Define 

' (1 - a)? tMl + «? gk if fyfc) < v < V g {k + 1) 

if v>t>(n), 

and 



a = 



+ 1) " V 8 (k) 

S g (v) is the linear interpolation of {^ 0 » • ■ • » Ygn)- 
ii. Compute 

for k — k e and = k r . 
7. Compute the bias statistic. 

(a) Compute J* = Yik=o HWgk* *ke numDer °f included group y examinees 

(b) Compute 

(£Z=.^r(s*«+«*))' /2 

(c) Reject ff : ft, = 0 in favor of ft, > 0 at level a if B > z a , where P[JV(0,1) > 
2 0 ] = a defines z a . 



I 
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Table 1: Means and sds for the ASBAB and ACT item parameters used in the study. 



Test 


a 




I 


<*\ 


2 




N 


ASVAB auto/shop 


1.22 


0.7 


0.09 


0.72 


0.20 


0.06 


25 


ACT math 


1.09 


0.35 


0.5 


0.61 


0.14 


0-04 


40 



Table 2: Item parameters for 2- dimensional, studied in the bias case. 





Item No. 




b it 








1 


N 


1.0 


0.0 


O.S 


0.0 


c 


3 


N-2 


0.6 


-0.3 


0.4 


0.0 


c- \cr e 




;V-1 


O.S 


0.0 


0.4 


0.0 


c 






1.0 


0.3 


0.4 


0.0 





Table 3: Equivalence table for bias potential and actual test bias. 



77 6 


c$ 


a i 


0u 


1 


0.0 




0 


1 


0.2 


0.8 


0.03 


1 


0.3 


O.S 


0.05 


3 


0.0 




0 


3 


0.2 


0.4 


0.06 


3 


0.3 


0.4 


0.09 



Table 4: Equivalence of Amh and when nt = 1, using item parameters of Table 2. 



c» 


c's used 






0.0 




0 


0 


0.2 


0.0 


.27 


0.034 


0.2 


actual c's 


.27 


0.026 


0.3 


0.0 


.40 


0.051 


0.3 


actual c's 


.39 


0.039 



Table 5: No bias, ACT, = 1, a = 0.05. 



Jf 


Jr 


c 


• 


MH 


SIB 


1500 


1500 


0 


.0 


.03 


.07 


1000 


3000 


0 


.0 


.00 


.02 


3000 


3000 


c 


.0 


.09 


.06 


loOO 


1500 


0 


.0 


.04 


.04 


1000 


3000 


c 


.5 


.10 


.10 


3000 


3000 


c 


.5 


.05 


.03 


1500 


1500 


c 


1.0 


.02 


.05 


1000 


3000 


c 


1.0 


.05 


.10 


3000 


3000 


0 


1.0 


.06 


.09 



Table 6: No bias, ACT, n h = 3, a = 0.05. 



Jf 


Jr 


c 


df 


SIB 


1500 


1500 


0 


.0 


.05 


1000 


3000 


0 


.0 


.02 


3000 


3000 


c 


.0 


.07 


1500 


1500 


0 


.5 


.OS 


1000 


3000 


c 


.5 


.07 


3000 


3000 


0 


.5 


.05 


1500 


1500 


c 


1.0 


.06 


1000 


3000 


c 


1.0 


.16 


3000 


3000 


0 


1.0 


.09 



Table 7: No bias, ASVAB, n t = 1, a= 0.05. 



Jf 


J* 


c 


df 


MH 


SIB 


1500 


1500 


0 


.0 


.OS 


.07 


1000 


3000 


0 


.0 


.04 


.04 


3000 


3000 


c 


.0 


.06 


.06 


1500 


1500 


0 


.5 


.13 


.14 


1000 


3000 


c 


.5 


.04 


.03 


3000 


3000 


c 


.5 


.05 


.04 


1500 


1500 


c 


1.0 


.07 


.02 


1000 


3000 


c 


1.0 


.15 


.09 


3000 


3000 


0 


1.0 


.11 


.01 
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Table S: No bias, ASVAB, «* = 3, a = 0.05. 



Jf 


Jr 


c 


J 




n <e A A 

1500 


1500 


A 

u 


n 


•Uf 


i AAA 

1000 


onnn 
oUUU 


n 
u 


n 


OA 
• U* 


A AAA 

3000 


onnn 
oUUU 


c 


n 

• U 


.uo 


1 CAA 

1500 


i *nn 


n 
U 


e 

.0 


• U f 


i nnn 

1UUU 


3000 

VVvv 


#; 

w 


•5 


.06 


3000 


3000 


0 


.5 


•05 


1500 


1500 


e 


1.0 


•15 


1000 


3000 


c 


1.0 


•07 


3000 


3000 


0 


1.0 


•04 



Table 9: Bias, a, = 0.8, ACT, n h m 1, a = 0.05. 





J* 


e 


dj 


c$ 


fiu 


ft* 




MH 


SIB 


1500 


1500 


e 


0 


.2 


.026 


.032 


.27 


.46 


.58 


1000 


3000 


0 


0 


.2 


.032 


.042 


.27 


.64 


.70 


3000 


3000 


0 


0 


.2 


.032 


.035 


.27 


.91 


.95 


1500 


1500 


e 


.5 


.2 


.029 


.035 


.27 


.51 


.60 


1000 


3000 


0 


.5 


.2 


.034 


.044 


.27 


.65 


.72 


3000 


3000 


0 


.5 


.2 


.034 


.038 


.27 


.91 


.94 


1500 


1500 


0 


0 


.3 


.048 


.052 


.40 


.84 


.90 


1000 


3000 


e 


0 


.3 


.042 


.053 


.40 


.S7 


.91 


3000 


3000 


c 


0 


.3 


.042 


.045 


.40 


.97 


1.00 


1500 


1500 


0 


.5 


.3 


.050 


.047 


.40 


.99 


.99 


1000 


3000 


e 


.5 


.3 


.042 


.054 


.40 


.SO 


.84 


3000 


3000 


e 


.5 


.3 


.042 


.064 


.40 


.91 


.92 



Table 10: Bias, a, = 0.4, ACT, r\\, = 3, a = 0.05. 



Jf 


Jr 


c 


dj 


c$ 


fiu 


fiu 


SIB 


1500 


1500 


0 


0 


.2 


.063 


.069 


,70 


1000 


3000 


c 


0 


.2 


.053 


.067 


.68 


3000 


3000 


c 


0 


.2 


.053 


.053 


.SO 


1500 


1500 


c 


.5 


.2 


.055 


.071 


.60 


1000 


3000 


0 


.5 


.2 


.065 


.083 


.72 


3000 


3000 


0 


.5 


.2 


.065 


.074 


.96 


1500 


1500 


0 


0 


.3 


.093 


.095 


.91 


1000 


3000 


0 


0 


.3 


.093 


.11 


.89 


3000 


3000 


c 


0 


.3 


.OSO 


.081 


.99 


1500 


1500 


0 


.5 


.3 


.097 


.12 


.97 


1000 


3000 


e 


.5 


.3 


.0S4 


.11 


.S9 


3000 


3000 


c 


.5 


.3 


.083 


.09 


1.00 
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Table 11: Bias, a, = 0.8, ASVAB, n h m 1, a = 0.05. 



Jp 


Jr 


c 








A 




MH 


SIB 


1500 


1500 


c 


0 


.2 


.026 


.029 


.27 


.42 


.50 


1000 


3000 


0 


0 


.2 


.034 


.039 


.27 


.63 


.79 


3000 


3000 


0 


0 


.2 


.034 


.034 


.27 


.90 


.95 


1500 


1500 


c 


.5 


.2 


.027 


.035 


.27 


.63 


.66 


1000 


3000 


0 


.5 


.2 


.034 


.038 


.27 


.63 


.70 


3000 


3000 


0 


.5 


.2 


.034 


.036 


.27 


.89 


.91 


1500 


1500 


0 


0 


.3 


.051 


.052 


.40 


.85 


.92 


1000 


3000 


c 


0 


.3 


.042 


.044 


.40 


.77 


.84 


3000 


3000 


c 


51 


.3 


.042 


.046 


.40 


.99 


.99 


1500 


1500 


0 


.5 


.3 


.051 


.057 


.40 


.91 


.93 


1000 


3000 


c 


.5 


.3 


.038 


.048 


.40 


.77 


.82 


3000 


3000 


c 


.5 


.3 


.039 


.045 


.40 


.94 


.97 



Table 12: Bias, a n = 0.4, ASVAB, n fc = 3, a = 0.05. 



Jf 


Jr 


c 


df 




flu 


A 


SIB 


1500 


1500 


0 


0 


.2 


.065 


.067 


.70 


1000 


3000 


c 


0 


.2 


.052 


.056 


.53 


3000 


3000 


c 


0 


.2 


.052 


.053 


.85 


1500 


1500 


c 


.5 


.2 


.052 


.068 


.63 


1000 


3000 


0 


.5 


.2 


.064 


.083 


.73 


3000 


3000 


0 


.5 


.2 


.064 


.072 


.92 


1500 


1500 


0 


0 


.3 


.098 


.10 


.94 


1000 


3000 


0 


0 


.3 


.097 


.10 


.97 


3000 


3000 


c 


0 


.3 


.079 


.079 


.98 


1500 


1500 


0 


.5 


.3 


.097 


.011 


.98 


1000 


3000 


c 


.5 


.3 


.076 


.093 


.87 


3000 


3000 


c 


.5 


.3 


.078 


.090 


.99 
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